Skip to content

Conversation

vjanfaza
Copy link

Compute-Context-Length (CCL) technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

@vjanfaza vjanfaza reopened this Oct 8, 2025
vjanfaza and others added 16 commits October 7, 2025 17:13
quic#557)

Updated the run_vlm_kv_model_on_pytorch and run_vlm_kv_model_on_ort
methods to run for the latest dual QPC setup. Along with the required
changes to be made in the Input Handler of VLMs.

Also updated the way head_dim is calculated for past_key_value creation
as certain models now provide specific head_dim. We fallback to previous
method if the parameter isn't found in the config.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants